1. Citation

Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

2. About dataset

The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv

3. Number of Instances:

red wine - 1599; white wine - 4898.

4. Number of Attributes:

11 + output attribute

5. Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

6. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Load Packages

# Packages used in this EDA
library(ggplot2)
library (gridExtra)
## Loading required package: grid
library(GGally)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:GGally':
## 
##     nasa
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

Set chunk global options

Load Data

## [1] 6497   14
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"
## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : chr  "red" "red" "red" "red" ...
##        X        fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1   Min.   : 3.80   Min.   :0.08     Min.   :0.000  
##  1st Qu.: 813   1st Qu.: 6.40   1st Qu.:0.23     1st Qu.:0.250  
##  Median :1650   Median : 7.00   Median :0.29     Median :0.310  
##  Mean   :2044   Mean   : 7.21   Mean   :0.34     Mean   :0.319  
##  3rd Qu.:3274   3rd Qu.: 7.70   3rd Qu.:0.40     3rd Qu.:0.390  
##  Max.   :4898   Max.   :15.90   Max.   :1.58     Max.   :1.660  
##  residual.sugar    chlorides     free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.60   Min.   :0.009   Min.   :  1.0       Min.   :  6         
##  1st Qu.: 1.80   1st Qu.:0.038   1st Qu.: 17.0       1st Qu.: 77         
##  Median : 3.00   Median :0.047   Median : 29.0       Median :118         
##  Mean   : 5.44   Mean   :0.056   Mean   : 30.5       Mean   :116         
##  3rd Qu.: 8.10   3rd Qu.:0.065   3rd Qu.: 41.0       3rd Qu.:156         
##  Max.   :65.80   Max.   :0.611   Max.   :289.0       Max.   :440         
##     density            pH         sulphates        alcohol    
##  Min.   :0.987   Min.   :2.72   Min.   :0.220   Min.   : 8.0  
##  1st Qu.:0.992   1st Qu.:3.11   1st Qu.:0.430   1st Qu.: 9.5  
##  Median :0.995   Median :3.21   Median :0.510   Median :10.3  
##  Mean   :0.995   Mean   :3.22   Mean   :0.531   Mean   :10.5  
##  3rd Qu.:0.997   3rd Qu.:3.32   3rd Qu.:0.600   3rd Qu.:11.3  
##  Max.   :1.039   Max.   :4.01   Max.   :2.000   Max.   :14.9  
##     quality        color          
##  Min.   :3.00   Length:6497       
##  1st Qu.:5.00   Class :character  
##  Median :6.00   Mode  :character  
##  Mean   :5.82                     
##  3rd Qu.:6.00                     
##  Max.   :9.00

Observations from the summary

1.The alcohol content varies from 8.00 to 14.90 for the samples in dataset.
2.The quality of the samples range from 3 to 9 with 6 as median and 5.818 as mean.
3.The range for fixed acidity is quite high with minimum being 3.8 and maximum being 15.9.
4.pH value varies from 2.720 to 4.010 with a mean of 3.219 and median of 3.210.
5.Mean residual sugar is 5.443 but the max value is 65.800 indicating an outlier.
6.free.sulfur.dioxide has a mean of 30.53 and a high of 289.0.

Univariate Plots and Analysis Section

Analysis of all the single variables using plots

Fixed.acidity analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.80    6.30    6.80    6.85    7.30   14.20
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

plot of chunk fixed.acidity
Observation about fixed acidity of wine:

Fixed.acidity also known as Titratable acidity, either occur naturally in the grapes or are created through the fermentation process. Red wine seems to be more acidic than white wine as can be seen from the min, mean and max value of fixed.acidity in the summary. (http://winemakersacademy.com/understanding-wine-acidity)

Volatile.acidity analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.08    0.23    0.29    0.34    0.40    1.58
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.120   0.390   0.520   0.528   0.640   1.580
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.080   0.210   0.260   0.278   0.320   1.100

The volatile.acidity is slightly skewed as you can even though the mean is 0.3397 the max is 1.58, this is because of the red wine max value 1.58 so using scale_x_log10 to further analyze this.

plot of chunk volatile.acidity_log10
Observation about Volatile acidity of wine:

The majority of the volatile.acidity seems to be between 0.23 to 0.78. Our palates are quite sensitive to the presence of volatile acids and for that reason their concentrations should be as low as possible.

Citric.acid analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.250   0.310   0.319   0.390   1.660
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.334   0.390   1.660

The citric.acid is slightly skewed so using scale_x_log10 to further analyze this. This is maybe because even though the mean is .271 for red and .334 for white there are some outliers.

plot of chunk citric.acid.log10
Observation about Citric acidity of wine:

From the summary min is 0.0 and since the graph shows the number of wine having 0.0 to be between 0 and 250, wanted to see how many were either not reported or had a 0 value.

## [1] 151

There are around 151 observations had a value of 0.

Residual.sugar analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.60    1.80    3.00    5.44    8.10   65.80
## 
##   0.6   0.7   0.8   0.9  0.95     1  1.05   1.1  1.15   1.2  1.25   1.3 
##     2     7    25    39     4    93     1   146     3   187     3   147 
##  1.35   1.4  1.45   1.5  1.55   1.6  1.65   1.7  1.75   1.8  1.85   1.9 
##     2   184     4   142     2   165     2    99     1    99     3    59 
##  1.95     2  2.05   2.1   2.2  2.25   2.3  2.35   2.4   2.5   2.6  2.65 
##     2    79     1    51    56     2    42     1    41    40    33     1 
##   2.7   2.8  2.85   2.9     3   3.1  3.15   3.2   3.3   3.4   3.5   3.6 
##    38    36     1    25    17    17     1    28    23    13    31    22 
##   3.7  3.75   3.8  3.85   3.9  3.95     4   4.1   4.2  4.25   4.3  4.35 
##    12     2    21     3    17     3    19    17    31     2    19     1 
##   4.4  4.45   4.5  4.55   4.6   4.7  4.75   4.8  4.85   4.9     5   5.1 
##    14     3    33     2    40    29     5    38     1    35    43    28 
##  5.15   5.2  5.25   5.3  5.35   5.4  5.45   5.5  5.55   5.6   5.7   5.8 
##     2    29     4    17     2    23     2    13     1    16    30    23 
##  5.85   5.9  5.95     6   6.1   6.2   6.3  6.35   6.4   6.5  6.55   6.6 
##     2    19     1    23    21    31    39     1    34    26     1    30 
##  6.65   6.7  6.75   6.8  6.85   6.9  6.95     7  7.05   7.1   7.2  7.25 
##     3    25     1    28     6    20     1    31     2    36    29     2 
##   7.3  7.35   7.4  7.45   7.5   7.6   7.7  7.75   7.8  7.85   7.9  7.95 
##    19     2    40     1    30    29    34     2    41     1    32     1 
##     8   8.1  8.15   8.2  8.25   8.3   8.4  8.45   8.5  8.55   8.6  8.65 
##    32    34     1    36     2    31    13     1    24     1    27     1 
##   8.7  8.75   8.8   8.9  8.95     9  9.05   9.1  9.15   9.2  9.25   9.3 
##    18     2    22    23     1    18     1    17     2    22     2    11 
##   9.4   9.5  9.55   9.6  9.65   9.7   9.8  9.85   9.9    10 10.05  10.1 
##    10     9     1    18     4    22    16     3    18    18     3    14 
##  10.2  10.3  10.4  10.5 10.55  10.6 10.65  10.7  10.8  10.9    11  11.1 
##    23    16    25    16     1    22     1    26    17    11    19    18 
##  11.2 11.25  11.3  11.4 11.45  11.5  11.6  11.7 11.75  11.8  11.9 11.95 
##    18     2    12    14     1    11    15     8     4    35    16     3 
##    12 12.05  12.1 12.15  12.2  12.3  12.4  12.5 12.55  12.6  12.7 12.75 
##    16     1    21     4    15    13    19    16     2    16    16     1 
##  12.8 12.85  12.9    13  13.1 13.15  13.2  13.3  13.4  13.5 13.55  13.6 
##    25     4    25    19    23     1    13    16     7    10     3    12 
## 13.65  13.7  13.8  13.9    14 14.05  14.1 14.15  14.2  14.3 14.35  14.4 
##     4    21     8    18    16     1     4     1    20    17     3    17 
## 14.45  14.5 14.55  14.6  14.7 14.75  14.8  14.9 14.95    15  15.1 15.15 
##     3    17     3    13    14     2    12    14     2    13     7     1 
##  15.2 15.25  15.3  15.4  15.5 15.55  15.6  15.7 15.75  15.8  15.9    16 
##     6     1     9    17    11     6    14     9     1     6     2    10 
## 16.05  16.1  16.2  16.3  16.4 16.45  16.5 16.55  16.6 16.65  16.7 16.75 
##     6     2     7     7     5     1     3     1     2     5     5     2 
##  16.8 16.85  16.9 16.95    17 17.05  17.1  17.2  17.3 17.35  17.4 17.45 
##     4     4     3     3     1     1     5     9    14     1     2     2 
##  17.5 17.55  17.6  17.7 17.75  17.8 17.85  17.9 17.95    18 18.05  18.1 
##     8     3     2     1     4    13     5     2     3     2     3     6 
## 18.15  18.2  18.3 18.35  18.4  18.5  18.6 18.75  18.8  18.9 18.95  19.1 
##     8     3     2     4     1     1     1     4     3     1     3     1 
## 19.25  19.3 19.35  19.4 19.45  19.5  19.6  19.8  19.9 19.95 20.15  20.2 
##     3     4     1     2     3     2     1     4     1     3     1     2 
##  20.3  20.4  20.7  20.8    22  22.6  23.5 26.05  31.6  65.8 
##     1     1     2     2     2     1     1     2     2     1
## 
##  0.9  1.2  1.3  1.4  1.5  1.6 1.65  1.7 1.75  1.8  1.9    2 2.05  2.1 2.15 
##    2    8    5   35   30   58    2   76    2  129  117  156    2  128    2 
##  2.2 2.25  2.3 2.35  2.4  2.5 2.55  2.6 2.65  2.7  2.8 2.85  2.9 2.95    3 
##  131    1  109    1   86   84    1   79    1   39   49    1   24    1   25 
##  3.1  3.2  3.3  3.4 3.45  3.5  3.6 3.65  3.7 3.75  3.8  3.9    4  4.1  4.2 
##    7   15   11   15    1    2    8    1    4    1    8    6   11    6    5 
## 4.25  4.3  4.4  4.5  4.6 4.65  4.7  4.8    5  5.1 5.15  5.2  5.4  5.5  5.6 
##    1    8    4    4    6    2    1    3    1    5    1    3    1    8    6 
##  5.7  5.8  5.9    6  6.1  6.2  6.3  6.4 6.55  6.6  6.7    7  7.2  7.3  7.5 
##    1    4    3    4    4    3    2    3    2    2    2    1    1    1    1 
##  7.8  7.9  8.1  8.3  8.6  8.8  8.9    9 10.7   11 12.9 13.4 13.8 13.9 15.4 
##    2    3    2    3    1    2    1    1    1    2    1    1    2    1    2 
## 15.5 
##    1

plot of chunk residual.sugar.sum

There is an outlier at around 65, majority are between 0.6 to 21


Observation about residual sugar of wine:

White wine’s residual.sugar goes till 20 whereas red wine’s residual sugar goes to around 9 with an outlier of 65. So some of the white wine seems to be sweeter than the red wine. (http://wine.about.com/od/wineandhealth/qt/Which-Wine-Has-The-Least-Sugar.htm)

As we can see from the table’d data of red and white residual.sugar, a lot of wines in the sample are dry red and white wines. But some of the white wine seem to be off-dry wines were the residual.sugar fall between 10 - 30 grams and some have the sweetness of a champagne (6 - 20 grams.)

Chloride level analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.009   0.038   0.047   0.056   0.065   0.611

Chloride levels seemed to be skewed as you can see even though the mean is 0.05603 the max value is 0.61100, so going to use log10 scale to further analyze.

plot of chunk chloride.log10


Observation about Chlorides in wine:

Few white wines have lesser chloride levels. There are some outliers for red wine chloride levels.

Free SO2 analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    17.0    29.0    30.5    41.0   289.0

free sulfur dioxide data seems to be skewed so using log10 to further analyze.

plot of chunk free.SO2.log10


Observation about free sulfur dioxide in wine:

More white wines have higher levels of free sulfur dioxide. There are some outliers for white wine at 289.00.

This maybe because of the following reason: (http://www.morethanorganic.com/sulphur-in-the-bottle) Red wines do not need any added sulphur dioxide because they naturally contain anti-oxidants, acquired from their skins and stems during fermentation. Conventional winemakers add some anyway. White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide.

(http://en.wikipedia.org/wiki/Winemaking)

Also if you see many of our white wines were sweeter than the red wines – Sweet wines or off-dry wines are made by arresting fermentation before all sugar has been converted into ethanol and allowing some residual sugar to remain. This can be done by chilling the wine and adding sulphur and other allowable additives to inhibit yeast activity or sterile filtering the wine to remove all yeast and bacteria.

Total.SO2 analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6      77     118     116     156     440

total sulfur dioxide data seems to be skewed so using log10 to further analyze.

plot of chunk total.SO2.log10


Observation about total sulfur dioxide in wine:

More white wines have higher levels of total sulfur dioxide just as free sulfur dioxide. There are some outliers for white wine around 350.0.

Density Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.987   0.992   0.995   0.995   0.997   1.040

density data seems to be skewed so using log10 to further analyze.

plot of chunk density.log10


Observation about density in wine:

There is an outlier 1.03911 and between 1.00911 and 1.01111

Else most wine’s density range from 0.987 to 1.0031.

pH level analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.72    3.11    3.21    3.22    3.32    4.01

plot of chunk pH
Observation about pH in wine:

The wine’s in our sample have pH in the range of 3 to 3.5.

Level of Sulphates analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.220   0.430   0.510   0.531   0.600   2.000

plot of chunk sulphates


Observation about sulphates in wine:

There are some gaps in the data, either there is no data with those sulphate values was gathered or wines don’t have that sulphate value.

Alcohol content analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9

plot of chunk alcohol
Observation about Alcohol in wine:

Both red and white has the same alcohol distribution pattern.

The peak is around 9.5 for both red and white wine.

Analysis of quality of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00

plot of chunk quality

##     bad average    good 
##     246    4974    1277

plot of chunk quality
Observation about Quality of wine:

The distribution of wine quality graph appears to have the shape of normal distribution, the Quality is at peak at 5 and 6.

Also created a new variable Quality Rating which classified the wines into Bad, Average and Good bucket based on the quality of wine. Majority fell in the Average rating bucket.

Univariate Analysis

Did you create any new variables from existing variables in the dataset?

Created a new variable quality_rating which classified the wine’s into Bad, Average and Good bucket based on the quality of wine.

Of the features you investigated, were there any unusual distributions?

Density distribution of white wine is bimodal and of red wine is normal distribution.

Did you perform any operations on the data to tidy, adjust, or change the form of the data?

I did not tidy the data but to be able to analyze some of the skewed data I had to use log10.

Bivariate Plots and Analysis Section

pairs(wine)

After reviewing the ggpairs for strong correlation.

We see that there is a strong correlation between the following that can be analyzed further:

we can ignore the correlation between free.sulfur.dioxide and total.sulfur.dioxide as free.S02 is part of total.SO2, total.sulfur.dioxide vs free.sulfur.dioxide(corr - 0.721)
free.sulfur.dioxide vs residual.sugar(corr - 0.403), since the correlation between total.sulfur.dioxide vs residual.sugar is high we are ignoring the correlation between free.sulfur.dioxide vs residual.sugar.

Ref.:http://www.inside-r.org/packages/cran/psych/docs/pairs.panels

plot of chunk correlation

We can see few of the top correlation pairs are:

alcohol vs. density(corr - -0.69)
density vs residual.sugar(corr - 0.55)
total.sulfur.dioxide vs residual.sugar(corr - 0.50)
density vs fixed.acidity(corr - 0.46)
quality vs alcohol (corr -0.44)
total.sulfur.dioxide vs volatile.acidity(corr - -0.41)
chlorides vs sulphates (corr - 0.40)
chlorides vs volatile.acidity (corr - 0.38)
citric.acid vs fixed.acidity(corr - -0.38)
density vs chlorides (corr - 0.36)
alcohol vs residual.sugar (corr - -0.36)

Further analyzing the elements that affect wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00

Function to generate graphs to analyze different elements correlation with quality factor

Analysis of correlation between alcohol and quality

The quality of wine vs. Alcohol using box plots as it plays an important role in the microbial stabilization of both red and white wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9

In order to analyze the relationship between alcohol and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.

##      3      4      5      6      7      8      9 
## 10.215 10.180  9.838 10.588 11.386 11.679 12.180

Visually alcohol by quality levels along with median and mean is:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0     9.5    10.3    10.5    11.3    14.9

plot of chunk bivar_alcohol_visual

Observation about Alcohol vs. Quality of Wine:

Both red and white wine that are beyond the mean quality value of 5.818 show values beyond the mean alcohol value of 10.49.

In our sample only some white wines have the highest quality of 9.

Analysis of correlation between residual.sugar and quality

The quality of wine vs. Residual sugar is displayed using box plots as it an essential component in the production of wine.

During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine.

Ref: https://winemakermag.com/501-measuring-residual-sugar-techniques

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.60    1.80    3.00    5.44    8.10   65.80

In order to analyze the relationship between residual.sugar and quality, let us see how the residual.sugar values are distributed across varying quality and how it varies with quality.

##     3     4     5     6     7     8     9 
## 5.140 4.154 5.804 5.550 4.732 5.383 4.120

Visually residual.sugar by quality levels along with median and mean is:

plot of chunk bivar_rs_visual

Observation about residual.sugar vs. Quality of Wine:

Red wine quality is not impacted by residual.sugar and has less residual.sugar

White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.

White wine has higher residual.sugar than red wine.

Interesting Fact:* Winemaker who wishes to make a wine with high levels of residual sugar (like a dessert wine) may stop fermentation early either by dropping the temperature of the must to stun the yeast or by adding a high level of alcohol (like brandy) to the must to kill off the yeast and create a fortified wine.[9]*
Ref.: http://en.wikipedia.org/wiki/Fermentation_in_winemaking

Analysis of correlation between Chloride levels and quality

The quality of wine vs. chlorides which acts as a preserving agents in the preparation of liquid enzyme preparation which in turn is important for the microbiological stability of wines.
Ref.: http://www.westchesterwinemakers.com/2010/06/03/enzymes-in-winemaking-do-we-use-them-damm-straight-we-do/

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.009   0.038   0.047   0.056   0.065   0.611

In order to analyze the relationship between chlorides and quality, let us see how the chloride values are distributed across varying quality and how it varies with quality.

##       3       4       5       6       7       8       9 
## 0.07703 0.06006 0.06467 0.05416 0.04527 0.04112 0.02740

Visually chlorides by quality levels along with median and mean are:

plot of chunk bivar_cl_visual

Observation about Chlorides vs. Quality of Wine:

Both red and white wine that has less chlorides have high quality.

Red wine has more chloride content than white wine. White wine’s chloride content is below the mean chloride.

White wine has lower chloride levels than red wine.

Analysis of correlation between density of wine and quality

The quality of wine vs. density using box plots.

It is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.

https://answers.yahoo.com/question/index?qid=20140527020443AALJISW

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.987   0.992   0.995   0.995   0.997   1.040

In order to analyze the relationship between density and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.

##      3      4      5      6      7      8      9 
## 0.9957 0.9948 0.9958 0.9946 0.9931 0.9925 0.9915

Visually density by quality levels along with median and mean is:

plot of chunk bivar_dens_visual

Observation about Density vs. Quality of Wine:

Both red and white wine that has less density has high quality.

Red wine is more denser than white wine.

In our sample lot of white wines fall under the quality bucket that are between 4.5 to 7.5 only few have a high quality of 8.

In our sample of red wines majority are between quality 4.5 to 6.5 only some are quality level 7 and very few at 8.

plot of chunk bivar_so2_quality

plot of chunk bivar_fixed.acidity_quality

plot of chunk bivar_sulphates_quality

As you can see from SO2 vs Quality, Sulphates vs Quality and fixed.acidity vs Quality graphs

The quality of wine varies from 4.5 to 7.5 for both red and white wine irrespective of SO2, sulphates or fixed.acidity level.

Very few white wines are of high quality but the contribution of these elements seems to have no impact on quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol strongly correlates with quality of wine, as alcohol content increases wine quality increases.

Red wine quality is not impacted by residual.sugar and has less residual.sugar. White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.

White wine has higher residual.sugar than red wine.

Both red and white wine that has lower chloride level has high quality.

Both red and white wine that has less density has high quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between some elements varies with the color of wine. density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.

What was the strongest relationship you found? Alcohol vs Quality is the strongest relation I found for both wine as per given data.

Multi-variate Plots and Analysis Section

By plotting against each other and faceted by wine quality_rating:

plot of chunk density_alcohol

The correlation between alcohol and density is strong and -ve for both white and red wines.

As the alcohol level increases the density of wine decreases

plot of chunk residual.sugar_density

The correlation between residual.sugar and density is strong and +ve for white and red wines.

As residual.sugar increases density also increases as we can see from the average and god quality red and white wines.

The density of wine is close to that of water, dry wine is less, sweet wine is higher. Water has a density of 1.000 Kg/L Ethanol has a density of 0.789 Kg/L Sugar has a density of 1.587 Kg/L

So wine with 13% alcohol by volume and 0.5% sugar by volume has a density of

0.13 x 789 + 0.005 x 1587 + 0.865 x 1000 = 975.5 Kg/L

(Ref.:http://www.answers.com/Q/What_is_the_density_of_wine)

plot of chunk residual.sugar_total.sulfur.dioxide

The correlation between residual.sugar and total.sulfur.dioxide is weak for white and red wine.

plot of chunk density_fixed.acidity

The correlation between density and fixed.acidity is strong and positive for red wine and none for white wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.82    6.00    9.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6      77     118     116     156     440

plot of chunk total.sulfur.dioxide_volatile.acidity

There is no correlation between volatile.acidity and total.sulfur.dioxide for red and white wines.

plot of chunk chlorides_sulphates

The correlation between chlorides and sulphates is strong and positive for red and none for white wines.

plot of chunk chlorides_volatile.acidity

There is no correlation between chlorides and volatile.acidity for red and white wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.250   0.310   0.319   0.390   1.660

plot of chunk citric.acid_fixed.acidity

The correlation between fixed.acidity and citric.acid is strong for red wines and for white wines the correlation between fixed.acidity and citric.acid weakens as it goes from bad to good quality rating.

plot of chunk chlorides_density

The correlation between chlorides and density is strong for red and white wines.

plot of chunk residual.sugar_alcohol

The correlation between alcohol and residual.sugar is strong for white wines and weak to none for red wines.

During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine. (Ref.:https://winemakermag.com/501-measuring-residual-sugar-techniques)

So as residual.sugar level increases alcohol level decreases for white wine.

So in summary

Element pairs Correlation Red White Corr
alcohol vs. density S S 0.69
residual.sugar vs. density S S 0.55
residual.sugar vs. total.sulfur.dioxide W W 0.50
density vs. fixed.acidity S N 0.46
quality vs. alcohol S S 0.44
volatile.acidity vs. total.sulfur.dioxide N N 0.41
chlorides vs. sulphates S N 0.40
volatile.acidity vs. chlorides N N 0.38
fixed.acidity vs. citric.acid S W 0.38
chlorides vs. density S S 0.36
residual.sugar vs. alcohol N S 0.36

From above it is evident that the following correlations depend on the color of the wine
density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.

Since the number of Red wine is 1/3rd of number of white wine in the sample the correlation between the elements of the sample follow the white rather than red.

So below we are going to analyze some of the key correlations of red wine.

plot of chunk red_panel

In case of red wine the top correlation are between the following elements

Element pairs Correlation Corr
fixed.acidity vs pH (-)0.68
fixed.acidity vs citric.acid 0.67
fixed.acidity vs density 0.67
volatile.acidity vs citric.acid (-)0.55
citric.acid vs pH (-)0.54
density vs. alcohol (-)0.50
## 
##     bad average    good 
##      63    1319     217

plot of chunk r_fixed.acidity_pH

As you can see the pH level decreases as acidity increases The correlation between pH and fixed.acidity is negative and does not provide a clear relationship to quality.

Now let us look at the correlation between pH and fixed.acidity for good and bad quality wine

plot of chunk r_sub_fixed.acidity_pH

  • Still no clear relationship to quality*

plot of chunk r_fixed.acidity_citric.acid

Wine of quality level 6 has a higher concentration between fixed.acidity level 6 and 10 and citric.acid level between 0 and 0.37.

** Further analyzing the correlation for bad and good quality wine**

plot of chunk r_fixed.acidity_citric.acid_bg

As fixed.acidity increases there is an increase in the citric.acid level in Red wine, maybe because citric.acid is a form of Titratable acidity (i.e. fixed.acidity).

Quality level 7 has higher content of citric.acid, indicating higher quality of red wines has more citric.acid in them

plot of chunk r_fixed.acidity_density

Quality of red wine increases along with the increase in the concentration of fixed.acidity and density.

plot of chunk r_volatile.acidity_citric.acid

The correlation between volatile.acidity and citric.acid is negative that is as volatile.acidity increases the citric.acid of red wine decreases.

And majority of the wine with high levels of citric acid is in quality level 7 and those with lower levels fall in the quality level 5 range.

This supports the previous theory that level of citric.acid in red wine contributes towards its quality factor.

While fixed.acidity has a positive impact on wine quality volatile.acidity seems to have a negative impact on quality.

plot of chunk r_citric.acid_pH

pH and Citric.acid correlation seems to have a positive impact on the quality of red wine one way or other.

Most of the good quality wines pH fall between 3 and 3.5 and citric.acid levels increase in good quality wine.

plot of chunk r_density_alcohol

Majority of red wine with Quality factor of 7 has alcohol content above 10.

Fixed.acidity is less
Citric.acid is high
Alcohol is high
pH between 3 and 3.5

plot of chunk white_panel

In case of white wine the top correlation are between the following elements

Element pairs Correlation Corr
residual.sugar vs density 0.84
density vs. alcohol (-)0.78
total.sulfur.dioxide vs density 0.53
residual.sugar vs alcohol (-)0.45
total.sulfur.dioxide vs alcohol (-)0.45
pH vs. fixed.acidity (-)0.43

plot of chunk white_residual.sugar_density

The white wine quality is high when the density of wine is less.

plot of chunk w_density_alcohol

The white wine quality is high when alcohol is high but the correlation between alcohol and density is negative. This again confirms our above finding about density.

## 
##    10    19    25    26    29    31    33    34    37    44    47    49 
##     1     1     1     1     1     1     1     2     2     1     1     1 
##    50    51    53    55    56    57    59    60    61    62    64    65 
##     2     1     1     3     2     2     2     1     3     1     1     5 
##    66    67    68    69    70    71    72    73    74    75    76    77 
##     2     2     5     3     3     3     5     7     3     7    11     2 
##    78    79    80    81    82    83    84    85    86    87    88    89 
##     5     6     6     7     7    10     5     5    10    13     9    12 
##    90    91    92    93    94    95    96    97    98    99   100   101 
##     9     3     9    18    10    11    13    16    22     8    14    10 
##   102   103   104   105   106   107   108   109   110   111   112   113 
##    20     6    13    14    11    19     8     7    11    22     7    24 
##   114   115 115.5   116   117   118   119   120   121   122   123   124 
##    25    16     1    10    13    20    13    12    15    17     8    17 
##   125   126   127   128   129 129.5   130   131   132   133   134   135 
##    13    13    15    19    11     2    15     9    14    12    18    14 
##   136   137   138   139   140   141   142   143   144   145   146   147 
##    10     3    18     4    15     6    15    12     7     5     4     8 
##   148   149   150   151   152   153   154   155   156   157   158   159 
##    10    15    15    10    12     8     4    12     5     6     9     2 
##   160   161   162   163   164   165   166   167   168   169   170   171 
##     3     5     8     7     6     4     5     7    13     4     5    10 
##   172   173   174   175   177   178   179   180   181   182   184   185 
##     4     4     3     3     5     9     4     4     5     2     1     1 
##   186   187   188   189 189.5   190   191   192   193   194   195   196 
##     4     3     2    10     1     1     4     5     3     1     3     2 
##   197   198   199   200   201   203   205   206   208   209   210   212 
##     2     1     2     7     2     3     4     1     1     1     2     3 
## 212.5   214   216   225   227   228   229   233 234.5   245   272 307.5 
##     6     2     1     1     1     1     6     1     1     1     1     1 
## 366.5   440 
##     1     1

plot of chunk w_totalso2_dens

Majority of white wine with low density and total.sulfur.dioxide between 75mg/L to 175mg/L seem to have high quality.

(Ref.http://www.practicalwinery.com/janfeb09/page5.htm) – White wines have more total SO2 than red wines (as dessert and fortifiedwines, that are very sweet we would need more SO2).

plot of chunk w_residual.sugar_alcohol

The white wine quality is higher when residual.sugar is below 10 and alochol content is high. Further analyzing for only bad and good quality white wines.

plot of chunk w_residual.sugar_alcohol_bg

plot of chunk w_total.so2_alcohol

The quality of white wine is high when total.sulfur.dioxide is < 250 and alcohol content is high.

plot of chunk w_pH_fixed.acidity

correlation of pH vs. fixed.acidity in relation to quality is inconclusive. All we can say is wines with pH between 3 and 3.5 have good quality and the fixed.acidity level is between 5 and 8.

Quality of white wine is good, when

Density is less
residual.sugar is below 10g
alcohol content is high
Total.sulfur.dioxide is between 75mg/L to 175mg/L .

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relationship between alcohol and density is -ve and strong which has a positive impact on the quality of wine.

In case of white wine the strongest correlation(+ve) is between residual.sugar and density.

In case of red wine the strongest correlation (-ve) was between fixed.acidity and pH.

Were there any interesting or surprising interactions between features?

Correlation between some of the elements was dependent on the wine.

Final Plots and Summary

Plot One

plot of chunk Plot1

Description One:

Wort /ˈwɜrt/ is the liquid extracted from the mashing process during the brewing of beer or whisky. Wort contains the sugars that will be fermented by the brewing yeast to produce alcohol.

The density of a wort is largely dependent on the sugar content of the wort. During alcohol fermentation, yeast converts sugars into carbon dioxide and alcohol. The decline in the sugar content and the presence of ethanol (which is appreciably less dense than water) drop the density of the wort.

Ref.:http://en.wikipedia.org/wiki/Gravity_%28alcoholic_beverage%29

Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.

White wine seem to be have lower density than red wine.

So from the graphs it is evident that wines with low density have high quality.

Below we are going to see the impact of alcohol level on the wine quality.

Plot Two

plot of chunk Plot2

Description Two

The Quality of wine (both red and white) seems to increase with the increase in the level of alcohol, in our sample except of wines with quality level of 5 the rest seems to support that theory.

The below graph shows the correlation between alcohol and density in red and white wines in our sample.

Plot Three

plot of chunk plot3

So both red and white wine the alcohol and density has a strong -ve correlation i.e. as alcohol level increases density of the wine decreases.

Also alcohol and density have a strong -ve correlation of -0.69.

Now the impact of density and alcohol on quality_rating of wine can be depicted as

plot of chunk Plot3_corr_anal

## 
##     bad average    good 
##     246    4974    1277

Description Three

Even though our graph and the data does indicate that higher alcohol content and lower density contribute to a good quality wine. The correlation between quality vs. alcohol doesn’t seem to be that strong (0.44).

So to analyze that further plotted the correlation using quality_rating, which showed that lower density and higher alcohol level content in wines have a direct correlation with the quality of wine.

The table of quality_rating showed the reason for the weaker correlation is majority of our wine sample fall under (4,6] which is average quality bucket.

Below link gives the 5 key components of wine.

http://www.snooth.com/articles/five-key-wine-components-and-how-to-detect-them/?viewall=1

Reflection

The wine data set contains information from both red and white wine. I started by understanding the individual variables in the data set by plotting graphs and also visiting websites to see what contribution each elements make.

Then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine based on density and alcohol.

It is interesting that even though the graph does show that increase in alcohol content is an indication of good quality wine, the correlation between quality and alcohol is not strong.

Then further analyzing realized that the majority of the sample of data falls between 4 - 6 quality (which is average) and hence maybe the correlation is not a true reflection.

Our sample has 250 mg/l for white wines with residual sugar greater than 5 g/litre (Moelleux wines), and 300 mg/l for liquoreux sweet wines (Ref.:http://en.wikipedia.org/wiki/White_wine)

According to http://winemakersacademy.com/importance-ph-wine-making/, pH is the backbone of any wine. Even though the data shows that the wine’s in our sample have pH in the range of 3 to 3.5. It does not have strong relation to the quality of wine which was kind of surprising to me.

The sample data positively reinforces the characteristics of the components in a white wine.

White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide.

Also if you see many of our white wines were sweeter than the red wines – Sweet wines or off-dry wines are made by arresting fermentation before all sugar has been converted into ethanol and allowing some residual sugar to remain. This can be done by chilling the wine and adding sulphur and other allowable additives to inhibit yeast activity or sterile filtering the wine to remove all yeast and bacteria.

During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine.

For further analysis

The data should have more red wine sample so the analysis is not favoring the characteristic of one wine over another.

Also it should have additional fields for the -

Type of wine like if it was dry, off-dry , fortified wine, Sparkling wine etc. Because in the current analysis I used the data and assumed it was what type of wine and based on that assessed the quality. Since certain type of wine should have components at certain level my analysis may not have been accurate.

Color of wine ( as wine ages with color).

Issues faced during the project:

The main issue I had was to understand the components of wine not from the given data but from the real process of wine making. So read up on some article to obtain that understanding. Even though some of the components that were mentioned are of importance in the wine making were not reflected in the given sample of data so then tried to analyze them separately and then realized that the correlation of some of the components were color (red or white ) dependent.